AITopics | video frame

Collaborating Authors

video frame

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Agents

Neural Information Processing SystemsJun-23-2026, 12:33:23 GMT

To address this problem, fine-tuning longcontext LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to singleturn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini1.5-Pro and GPT-4o when utilized with a 72B model.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country: Asia > China (0.46)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Storyboard-guided Alignment for Fine-grained Video Action Recognition

Neural Information Processing SystemsJun-17-2026, 07:19:32 GMT

Fine-grained video action recognition can be formulated as a video-text matching problem. Previous approaches primarily rely on global video semantics to consolidate video embeddings, often leading to misaligned video-text pairs due to inaccurate atomic-level action understanding. This inaccuracy arises due to i) videos with distinct global semantics may share similar atomic actions or visual appearances, and ii) atomic actions can be momentary, gradual, or not directly aligned with overarching video semantics. Inspired by storyboarding, where a script is segmented into individual shots, we propose a multi-granularity framework, SFAR. SFAR generates fine-grained descriptions of common atomic actions for each global semantic using a large language model. Unlike existing works that refine global semantics with auxiliary video frames, SFAR introduces a filtering metric to ensure correspondence between the descriptions and the global semantics, eliminating the need for direct video involvement and thereby enabling more nuanced recognition of subtle actions. By leveraging both global semantics and fine-grained descriptions, our SFAR effectively identifies prominent frames within videos, thereby improving the accuracy of embedding aggregation. Extensive experiments on various video action recognition datasets demonstrate the competitive performance of our SFAR in supervised, few-shot, and zero-shot settings.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment > Sports (0.46)
Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

DanmakuTPPBench: AMulti-modal Benchmark for Temporal Point Process Modeling and Understanding

Neural Information Processing SystemsJun-15-2026, 18:40:55 GMT

We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)
Banking & Finance (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

26b7e6eeb57bce1005587bd880a80c1f-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsJun-15-2026, 18:21:11 GMT

When instructed to place a floor lamp next to an armchair, humans can visually ground it in the scene, estimating its base diameter and height, imagining its precise alignment with the armchair, and judging whether it fits naturally within the 3D environment. Humans can naturally perceive, reason about, and localize expressions to "anywhere" in 3D scenes. Yet can today's 3D vision-language models ground free-form referring expressions to precise positions and dimensions in a 3D scene, especially when those expressions refer to regions beyond objects? Existing 3D visual grounding models, pretrained on large 3D scene datasets, excel at aligning expressions to objects in a scene [7, 58, 2, 63, 61, 26]. However, these models remain constrained to object-level alignment, with limited attention paid to the broader spatial regions beyond objects.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

Neural Information Processing SystemsJun-14-2026, 06:30:25 GMT

Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. Code and additional supporting materials are provided in the supplementary.

artificial intelligence, name change, proceedings, (3 more...)

Neural Information Processing Systems

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (0.59)

Add feedback

Vid-SME: Membership Inference Attacks against Large Video Understanding Models

Neural Information Processing SystemsJun-13-2026, 14:48:14 GMT

Multimodal large language models (MLLMs) demonstrates remarkable capabilities in handling complex multimodal tasks and are increasingly adopted in video understanding applications. However, their rapid advancement raises serious data privacy concerns, particularly given the potential inclusion of sensitive video content, such as personal recordings and surveillance footage, in their training datasets. Determining improperly used videos during training remains a critical and unresolved challenge. Despite considerable progress on membership inference attacks (MIAs) for text and image data in MLLMs, existing methods fail to generalize effectively to the video domain. These methods suffer from poor scalability as more frames are sampled and generally achieve negligible true positive rates at low false positive rates (TPR@Low FPR), mainly due to their failure to capture the inherent temporal variations of video frames and to account for model behavior differences as the number of frames varies.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Industry: Information Technology > Security & Privacy (0.96)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Supplementary Material for " Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations "

Neural Information Processing SystemsApr-27-2026, 14:19:57 GMT

Potential negative societal impacts Although our work improves the performance of text-video retrieval, but may reduce the difficulty of cross-modal retrieval of sensitive information on the network. It may raise challenges to protecting information security. Limitations of our work Iterative approaches are sensitive to initialization and parameters such as the dimensions and the number of subspaces. In our work, although we use the L2 normalization operation to limit the value range of the parameters, the EM algorithm [3] may still converge to bad results. At the same time, the selection of the number of subspaces also has a relatively significant impact on the model effect.

artificial intelligence, machine learning, video, (15 more...)

Neural Information Processing Systems

Country:

Asia > China (0.16)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Industry: Information Technology > Security & Privacy (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

GTQuery-based flowOp4cal flow

Neural Information Processing SystemsApr-25-2026, 07:06:25 GMT

Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong querybased image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame.

machine learning, natural language, segmentation, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Motion Graph Unleashed: A Novel Approach to Video Prediction

Neural Information Processing SystemsMar-22-2026, 11:35:40 GMT

We introduce motion graph, a novel approach to address the video prediction problem, i.e., predicting future video frames from limited past data. The motion graph transforms patches of video frames into interconnected graph nodes, to comprehensively describe the spatial-temporal relationships among them. This representation overcomes the limitations of existing motion representations such as image differences, optical flow, and motion matrix that either fall short in capturing complex motion patterns or suffer from excessive memory consumption. We further present a video prediction pipeline empowered by motion graph, exhibiting substantial performance improvements and cost reductions. Extensive experiments on various datasets, including UCF Sports, KITTI and Cityscapes, highlight the strong representative ability of motion graph. Especially on UCF Sports, our method matches and outperforms the SOTA methods with a significant reduction in model size by 78% and a substantial decrease in GPU memory utilization by 47%.

artificial intelligence, name change, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.78)

Add feedback

Filters

Collaborating Authors

video frame

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Agents

Storyboard-guided Alignment for Fine-grained Video Action Recognition

DanmakuTPPBench: AMulti-modal Benchmark for Temporal Point Process Modeling and Understanding

26b7e6eeb57bce1005587bd880a80c1f-Paper-Datasets_and_Benchmarks_Track.pdf

Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

Vid-SME: Membership Inference Attacks against Large Video Understanding Models

efe36e55d80a94d1726f660b8d237a0f-Paper-Conference.pdf

Supplementary Material for " Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations "

GTQuery-based flowOp4cal flow

Motion Graph Unleashed: A Novel Approach to Video Prediction